GEM-PRO - List of Gene IDs¶
This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging¶
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
Methods¶
-
GEMPRO.
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
In [8]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping: []
Out[8]:
kegg | refseq | uniprot | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|---|
gene | ||||||
b0761 | eco:b0761 | NP_415282 | P0A9G8 | 1B9M;1H9S;1B9N;1O7L;1H9R | eco-b0761.faa | eco-b0761.kegg |
b0889 | eco:b0889 | NP_415409 | P0ACJ0 | 2GQQ;2L4A | eco-b0889.faa | eco-b0889.kegg |
b0995 | eco:b0995 | NP_415515 | P38684 | 1ZGZ | eco-b0995.faa | eco-b0995.kegg |
b1013 | eco:b1013 | NP_415533 | P0ACU2 | 4JYK;4XK4;4X1E;3LOC | eco-b1013.faa | eco-b1013.kegg |
b1014 | eco:b1014 | NP_415534 | P09546 | 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... | eco-b1014.faa | eco-b1014.kegg |
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [9]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes mapped to UniProt
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: ['b1130', 'b0889', 'b1221', 'b0761', 'b0995', 'b1187', 'b1013', 'b1299', 'b1014', 'b1040']
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[9]:
uniprot | reviewed | gene_name | kegg | refseq | num_pdbs | pdbs | ec_number | pfam | seq_len | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
In [10]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[10]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
b0761 | P0A9G8 | eco:b0761 | 1B9M;1H9S;1B9N;1O7L;1H9R | eco-b0761.faa | eco-b0761.kegg |
b0889 | P0ACJ0 | eco:b0889 | 2GQQ;2L4A | eco-b0889.faa | eco-b0889.kegg |
b0995 | P38684 | eco:b0995 | 1ZGZ | eco-b0995.faa | eco-b0995.kegg |
b1013 | P0ACU2 | eco:b1013 | 4JYK;4XK4;4X1E;3LOC | eco-b1013.faa | eco-b1013.kegg |
b1014 | P09546 | eco:b1014 | 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... | eco-b1014.faa | eco-b1014.kegg |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
Methods¶
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes with at least one experimental structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5: number of genes with additional structures added from BLAST
Out[12]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
b0761 | 1b9n | B | 1091.0 | 5.530720e-119 | 0.931298 | 0.931298 | 244 | 244 |
b0761 | 1o7l | D | 1089.0 | 1.096280e-118 | 0.931298 | 0.931298 | 244 | 244 |
Downloading and ranking structures¶
Methods¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [13]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Saved 15 structures total
Out[13]:
chemicals | description | experimental_method | mapped_chains | pdb_id | pdb_title | resolution | structure_file | taxonomy_name | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0761 | NI | ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU... | X-RAY DIFFRACTION | A;B | 1b9m | REGULATOR FROM ESCHERICHIA COLI | 1.75 | 1b9m.mmtf | Escherichia coli |
b0761 | NI | MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU... | X-RAY DIFFRACTION | A;B | 1b9n | REGULATOR FROM ESCHERICHIA COLI | 2.09 | 1b9n.mmtf | Escherichia coli |
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
In [14]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[14]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0761 | REP-1b9n | True | pdb | 1b9n-A_clean.pdb |
b0889 | REP-2gqq | True | pdb | 2gqq-A_clean.pdb |
b1013 | REP-4xk4 | True | pdb | 4xk4-A_clean.pdb |
b1187 | REP-1h9t | True | pdb | 1h9t-A_clean.pdb |
b1221 | REP-1rnl | True | pdb | 1rnl-A_clean.pdb |
In [15]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()
Out[15]:
<StructProp REP-1h9t at 0x7f3880a275c0>
Out[15]:
{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
'chains': [<ChainProp A at 0x7f38830e1c18>],
'date': None,
'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
'file_type': 'pdb',
'id': 'REP-1h9t',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '1h9t',
'resolution': 3.25,
'structure_file': '1h9t-A_clean.pdb',
'taxonomy_name': 'ESCHERICHIA COLI'}
Saving your GEM-PRO¶
-
GEMPRO.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [16]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json